======================================================================================================================
Data analytics has played a major role in the entertainment indusrty for a while now. The need to understand human behavior in order to predict movies success, has driven the film industry to gather and manipulate data for business decision optimization and to get a slight advantage within the industry. Unfortunately, understanding and quantifying the factors that directly influence the success rate of a film may not always be enough to predict the outcome of a movie.
Taking into account the considerable investments in order to create and market a movie, the industry is still trying to figure out a perfect formula in order to minimize risk. So far, variables like:
* Release Dates
* Social Media Feedback
* Current Market Trends
to name a few, are the main focus for these analyses. Understanding when to release a film is crucial to it's success. You have to take into account holidays, other big events, the day of the week, what other movies are being released, time of the month, etc. Social media feedback is highly influencial when determining the success of a film. Taking into account any reaction, comments, and even likes, throughout various platforms about a rumor, an advancement, and expectations of a movie will help to predict the outcome of a movie.
Various platforms like IMDB and Rotten Tomatoes have helped by trying to quantify movies success, taking into account various factors that are then compared to some type of scoring scale throughout the years.
The goal of this project is to attempt to use past IMDB data and IMDB scores to understand movies performances, and to try to build a model to predict movies success. Personally, I would try to gather data that depicted human preferences and attitudes towards the movie, investment amounts, and anything regarding the release timing. For this project though, we will use a data set provided to us using IMDB's data base for about 5000 movies. The goal is to build a relationship between the above described variables and the scoring scale used by IMDB. For example, the investment for each movie should probably have a positive relationship with the success rate (IMDB score) of the movie. The amount of likes would also have a positive relationship with the IMDB score of the movie. It would be interesting to determine the realtionship between release date and IMDB score, but we have other variables such as duration of the movie and language. Duration would also be hard to quantify since it depends on the type of movie to determine if the people would actually mind the movie being too long or too short. Still, I would assume that the longer the movie the lower the score and the worse the people reviews.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
#lasso regression
from sklearn import linear_model
#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE
# random forest regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
#for validating your classification model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# random forest classifier
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
from statsmodels.formula.api import ols
# clustering
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
df = pd.read_csv("data/movie_data.csv")
df.head()
df.info()
df.isnull().sum()
df.isnull().sum().plot(kind='bar')
df.describe()
movie = df
movie.shape
corr = pd.DataFrame(movie.corr()['imdb_score'].drop('imdb_score'))
corr.sort_values(['imdb_score'], ascending = False)
The variables that I will be focusing on for this project are:
- Number of people who voted for each movie
- Number of critical reviews
- number of reviews from user for each movie
- Duration of each movie
- Number of facebook likes for each movie
- Director facebook likes
- Gross revenue, budget, and profit for each movie
These seem to be the most influential variables in terms of their impact on a good IMDB Score, or at least some that I think should have an impact. We will begin to analyze them and understand their effects and relationship to each score.
movie.info()
df.groupby('content_rating').size().head()
df.groupby('genres').size().head()
Variables like the movie link and plot keywords will be very tedious to work with. The link is just the address within the IMDB webpage were you can find the movie. It is standard for every movie and I assume (almost sure) it doesn't have any impact on a score. Plot keywords are just random words which are associated with the movie, they inflict no har or benefit to a movie's score. Getting rid of them would be a good way to simplify/clean our data set.
movie = movie.drop(['movie_imdb_link', 'plot_keywords'], axis=1)
movie.head()
movie.info()
movie.hist(figsize=(8,8))
We can see that the IMDB_score is very close to a symmetric shape which would suggest a normal distribution revolving around the mean. Apparently not many movies get scores below a 4 and not many can go up to a 8-8.5.
movie['imdb_score'].describe()
movie.info()
language_dist = movie.groupby('language').size() / len(movie)
language_dist.plot(kind='bar')
movie.isnull().sum()
There are missing values in the data set, but we could be loosing so much information, which at this point, and for the purpose of this project which be pointless. We are just looking for the impact of different movie variables on their IMDB scores.
I would like to see a profit column, to actually see the impact of the investment and it's performance on the movie's score.
movie['profit'] = movie['gross'] - movie['budget']
movie.head()
corr = pd.DataFrame(movie.corr()['imdb_score'].drop('imdb_score'))
corr.sort_values(['imdb_score'], ascending = False)
Profit still has a very low impact on the imdb score. I may be confirming my assumption that the most influencial variables are those which represent human reaction to each movie.
movie.shape
I would also like to check for duplicates in the data set because that would cause some type of effect on our analysis.
movie.duplicated().sum()
45 duplicates in the data-set.
movie.loc[movie.duplicated(), : ].head()
movie = movie.drop_duplicates(keep='first')
movie.info()
movie.shape
Clean data correlation analysis.
corr = pd.DataFrame(movie.corr()['imdb_score'].drop('imdb_score'))
corr.sort_values(['imdb_score'], ascending = False)
*Remember that this was the most influencial varaible (41% positive correlation).
sns.jointplot("imdb_score", "num_voted_users", movie)
movie.groupby('movie_title')['num_voted_users'].sum().sort_values(ascending=False).head(10)
movie.groupby('movie_title')['num_voted_users'].sum().sort_values(ascending=False).head(10).plot(kind='barh')
plt.xticks(rotation=90)
movie.groupby('imdb_score')['num_voted_users'].mean().sort_values(ascending=False).head()
The average amount of votes by users a movie should receive to rank among the top imdb scores.
sns.regplot(movie.imdb_score, movie.num_critic_for_reviews)
#"imdb_score", "num_critic_for_reviews", data=movie, kind="kde", space=0, color="r")
movie.groupby('movie_title')['num_critic_for_reviews'].mean().sort_values(ascending=False).head()
criticsandscore = movie[['imdb_score', 'num_critic_for_reviews']]
criticsandscore.sort_values('num_critic_for_reviews', ascending=False).head()
The best imdb scores with their avg amount of critic reviews required. It is clear the a good amount of critics reviews (interactions) is neede in order to get a good imdb score.
sns.regplot(movie.imdb_score, movie.num_user_for_reviews)
movie.groupby('imdb_score')['num_user_for_reviews'].mean().sort_values(ascending=False).head()
The average amount of user reviews to get a good IMDB Score.
sns.jointplot(x="imdb_score", y="duration", data=movie, kind="reg");
movie.groupby('imdb_score')['duration'].mean().sort_values(ascending=False).head()
movie.groupby('imdb_score')['duration'].mean().sort_values(ascending=False).plot(kind='bar', figsize=(15,8))
The mean duration of movies with high IMDB Scores.
top10 = movie.sort_values('profit', ascending=False).head(10)
top10
plt.figure(figsize=(15,10))
plt.bar(top10.movie_title, top10.profit)
plt.xlabel('Movie Titles')
plt.ylabel('Profit')
plt.title('Top 10 Profitable Movies')
plt.xticks(rotation=90)
movie.groupby('director_name').size().sort_values(ascending=False).head(10)
movie.groupby('director_name')['director_facebook_likes'].sum().sort_values(ascending=False).head(10)
movie.groupby('imdb_score')['budget'].mean().sort_values(ascending=False).head(10).plot(kind='bar')
sns.lmplot("imdb_score", "budget", df)
title_budget = movie[['movie_title', 'budget']]
title_budget.sort_values('budget', ascending=False).head(1)
movie[movie['movie_title'].str.contains("Host") & movie['director_name'].str.contains("Joon-ho")]
movie2 = movie[['movie_title', 'imdb_score']]
movie2.sort_values('imdb_score', ascending=False).head()
movie.groupby('country')['movie_title'].size().sort_values(ascending=False).head(10).plot(kind='pie', figsize=(10,10), autopct='%1.2f%%')
It is clear that almost 80% of the movies come from the USA, followed by the UK with almost 9.5%, and then other countries like France, Canada, Australia, Germany, Italy, etc., with lower percentages.
movie.groupby('country')['imdb_score'].mean().head(20).sort_values(ascending=False).plot(kind='bar')
The country with the highest average imdb score is Egypt.
movie.groupby('imdb_score')['director_facebook_likes'].mean().sort_values(ascending=False).head()
movie.corr()
plt.figure(figsize=(8, 8))
sns.heatmap(movie.corr())
sns.regplot(movie.imdb_score, movie.title_year)
corr = pd.DataFrame(movie.corr()['imdb_score'].drop('imdb_score'))
corr.sort_values(['imdb_score'], ascending = False)
plt.figure(figsize=(1, 5))
sns.heatmap(corr)
After a thorough analysis, I'd like to summarize what we've learnt throughout this project about the behavior of each variable and it's influence on a good IMDB Score. First of all, I didn't expect that the budget for a movie would have such a weak correlation with it's performance (or in this case, IMDB score). Maybe it just has to do with the way IMDB rates each movie and gives a little more importance to the impact of each movie on people and critics. It appears that the more human interaction a movie causes, the higher the IMDB Score. This is reflected by the high percentages of positive correlation between variables like num of users who voted for the movie, num of critic's reviews, num of user reviews, and others like the amount of facebook likes for the movie or director's likes. Duration suprised me as well. I would assume that there are few movies that could pull a 2-2:30 hour duration that would keep peolpe hooked and would receive positive reviews and get voted. I assumed that the longer the people remained in the theater the more tired and annoyed they would get. Still, there is a 26% positive correlation between the duration of a movie and it's IMDB score. Other variables like the actor's facebook likes don't show quite the strong correlation (<10%), but I guess they help somehow. The only two variables that show a negative impact or relationship to it's IMDB scores are the number of actors who appear in the poster and the year the title was released. That basically means that the older the movie the higher the score, which could make sense if you factored that it had more time to receive reviews and votes. Still, facenum_in_poster's impact is minimal, close to meaningless.
As I completed this project, I figured that there should be a better way to predict a movie's success. IMDB rates movies performances after they come out. That means, they give a lot of importance to how people react to your movie. As seen in the correlation analyses and every visualization, the most strong relationships with the IMDB Score, are those variables that include some type of human reaction to each movie. Whether it is a review from a user or a critic, the amount of votes that a specific movie received on a poll, or even the rating (facebook likes) for each director. All of these are variables that are most likely to be accurate after the movie's comes out, but what about things that you could do before releasing a title in order for it to be succesful? Of course, you could use some of these like, understanding who are the top directors or their avg IMDB Score, as well as actors, but still, in our analysis, they don't seem to carry as much weight. Even a variable like budget. There may be movies around with average or even lower budgets that get to succeed, but I would guess that there is a higher impact on the amount of resources you could gather to make a movie succesful that would have a higher significance level than just 3%.
Looking at the correlation for the variables we studied, we can understand that the most influential variables are those which include any type of human interaction with the movie itself. For example, the amount of users that voted for a movie. That has a 41% postitve relationship with the movie's IMDB Score. That means that the more votes a movie gets, the better rating it will receive. The same goes for critics and user reviews for the movie. Both carrying about a 30% positive relationship to the IMDB Score, they prove to be key aspects to take into consideration when releasing a new title. The duration of the movie must also be very well calculated. Duration carries a 26% weight into a positive relationship to a movie's IMDB Score. You need to make sure to stay within a safe margin of minutes a movie can have to keep everyone engaged while developing a good story. Other variables like gross revenue and profit carry some type of weight as well, but I feel like these variables will reflect your attention to the recently discussed ones. Ranging from 3%-20%, it is still very important that your movies receives a good amount of investment, to create the best queality content possible for your audience, and a good amount of revenue to cover the investment and still be profitable.
So basically, how can we use what we now know in order to predict a better performance for every movie? Maybe start releasing polls to understand people's interest on genres or what they expect. Maybe having people actually reading the reviews for past movies to get some feedback and insights on how to improve. After all, it's all about the customer. If you don't give the customers what they want, your movie will not be succesful in any aspect. Companies within the industry should constantly be watching out for trends and customer preferences to release movies that will recieve more viewers. The endgame is actually giving more importance to those variables we studied like the reviews and votes from critics and users, to the facebook likes a director or actor has, because in the end, that is the ultimate feedback that anyone in the movie industry can receive.
I don't have references to every webpage or other examples that I've used in this project, but I still want to give credit to the people who have helped me finish this in the best way possible.
As always, using stackoverflow to figure out different codes or at least a new way to write one is always helpful. www.stackoverflow.com
I also used the example provided by Dr. Chae to guide me through organizing my ideas and the project. rstudio-pubs-tatic.s3.amazonaws.com/342210_7c8d57cfdd784cf58dc077d3eb7a2ca3.html
Of course, I used past labs and homeworks to help me complete every aspect of this project, and disucssing a few ideas with friends and having someone else proofread and judge my work.
movie.info()
mov = movie.drop(['director_name', 'actor_1_name', 'actor_3_name', 'facenumber_in_poster', 'language', 'country', 'aspect_ratio', 'title_year', 'actor_2_name', 'movie_title'], axis=1)
mov.info()
mov.dropna(inplace=True)
mov.info()
mov['category'] = 1
mov['category'][(mov['imdb_score'] >= 4) & (mov['imdb_score'] <= 6)] = 2
mov['category'][(mov['imdb_score'] > 6) & (mov['imdb_score'] <= 8)] = 3
mov['category'][mov['imdb_score'] > 8] = 4
mov.head()
# convert genres to unique values and columns.
mov_dummies = mov
mov_dummies['genres'] = mov_dummies['genres'].str.replace('|', ', ')
mov_dummies.head(2)
mov_dummies['genres'] = mov_dummies['genres'].str.split(',')
mov_dummies.head(2)
# new data frames
mov_dummies_2 = mov_dummies['genres'].str.join('|').str.get_dummies()
mov_dummies_2.head(2)
mov_dummies = mov_dummies.join(mov_dummies_2)
mov_dummies.head()
content_dummies = pd.get_dummies(mov_dummies['content_rating'])
mov_dummies = mov_dummies.join(content_dummies)
mov_dummies.head(2)
color_dummies = pd.get_dummies(mov_dummies['color'])
mov_dummies = mov_dummies.join(color_dummies)
mov_dummies.head(2)
mov = mov.drop(['color', 'genres', 'content_rating'], axis=1)
mov.info()
corr1 = pd.DataFrame(mov.corr()['imdb_score'].drop('imdb_score'))
corr1.sort_values(['imdb_score'], ascending = False)
plt.figure(figsize=(1, 8))
sns.heatmap(corr1)
mov[['duration', 'imdb_score']].corr().plot()
mov[['num_voted_users', 'imdb_score']].corr().plot()
mov[['num_critic_for_reviews', 'imdb_score']].corr().plot()
We can clearly see that num_voted users, duration, num_critics_for_reviews are one of the most highly correlated values to the imdb_score. Their correlation coefficients are all positive, which means that the increase in any of these values increases the imdb_score (POSITIVE CORRELATION). At least, it means that these variables move together, and that in order to achieve a high imdb_score, a high value under these variables is preferred.
We will further analyze these assumptions.
corr_dummies = pd.DataFrame(mov_dummies.corr()['imdb_score'].drop('imdb_score'))
corr_dummies.sort_values(['imdb_score'], ascending = False)
plt.figure(figsize=(1, 8))
sns.heatmap(corr_dummies)
mov_dummies[['Drama', 'imdb_score']].corr().plot()
In this correlation analysis, we discovered that there are a few high or at least impactful correlations values affecting imdb_score among the different movie genres like Drama or Comedy. Still, none of these seem to be extremely impactful and including them in our model might create a "simplification" of life too complex to explain and understand.
# build a model
y = mov['imdb_score']
X = mov.drop(['imdb_score', 'category'], axis =1)
Remember that we created a category column derived from our imdb_score, so adding that column to our regression models will be somewhat redundant and will create a faulty model.
lr = lm.LinearRegression()
rfe = RFE(lr, n_features_to_select=2)
rfe_y = rfe.fit(X,y)
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: x, rfe.ranking_), X.columns))
sns.lmplot('duration', 'imdb_score', mov)
sns.lmplot('num_critic_for_reviews', 'imdb_score', mov)
We rely on this feature selection method to confirm the importance and rankings of these different variables we are considering to include in our models. According to our RFE, the most critical values include:
- Duration
- Num_critic_for_reviews
- Num_user_for_reviews
- Actor_3_facebook_likes
and so on...
We will also be including num_voted_users since it ranks pretty well on our past correlation analyses.
model1 = lm.LinearRegression()
model1.fit(X,y)
model1_y = model1.predict(X)
print 'Coefficients: ', model1.coef_
print "y-intercept ", model1.intercept_
pd.DataFrame(zip(X.columns, np.transpose(model1.coef_)))
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
print "mean square error: ", mean_squared_error(y, model1_y)
print "variance or r-squared: ", explained_variance_score(y, model1_y)
plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
The multiple regression yielded a very high MSE = 75% and a somewhat low R-Squared of 32%.
Still, we can see that our most impactful values in terms of their coefficients include a few of the ones we considered earlier with the RFE ranking = num_critic_for_reviews, duration, and num_user_for_reviews.
# RIDGE
rig = lm.Ridge(alpha=0.1) #higher alpha (penality parameter), fewer predictors
rig.fit(X, y)
rig_y = rig.predict(X)
print 'Coefficients: ', rig.coef_
print "y-intercept ", rig.intercept_
coef = ["%.3f" % i for i in rig.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
sorted(zip(coef, xcolumns), reverse=True)
print "mean square error: ", mean_squared_error(y, rig_y)
print "variance or r-squared: ", explained_variance_score(y, rig_y)
Surprisingly, our Ridge Regressor yielded the same results: MSE = 75% and R-Squared = 32.5%.
Same values considered.
model2 = lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
model2.fit(X, y)
model2_y = model1.predict(X)
print 'Coefficients: ', model2.coef_
print "y-intercept ", model2.intercept_
coef = ["%.3f" % i for i in model2.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)
sorted(zip(coef, xcolumns), reverse=True)
a = zip(xcolumns, coef)
df = pd.DataFrame(a)
df.sort_values(1, ascending=False)
print "mean square error: ", mean_squared_error(y, model2_y)
print "variance or r-squared: ", explained_variance_score(y, model2_y)
Again, our regularization method (Lasso) returns us to a high MSE and a somewhat low R-Squared = 75% and 32.5% respectively.
regr = RandomForestRegressor(random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)
print "mean square error: ", mean_squared_error(y, regr_predicted)
print "variance or r-squared: ", explained_variance_score(y, regr_predicted)
sorted(zip(regr.feature_importances_, X.columns))
plt.subplots()
plt.scatter(y, regr_predicted)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Now, we run onto our best model yet. The Random Forest Regressor (a model highly recommended by our instructor), yielded our highest R-Squared of almost 91% with a very low MSE of about 10%.
This means that our model explains/fits reality with about a 91% accuracy.
y_2 = mov['imdb_score']
X_2 = mov.drop(['budget','duration','num_voted_users', 'imdb_score', 'category'], axis =1)
regr = RandomForestRegressor(random_state=0)
regr.fit(X_2, y_2)
regr_predicted = regr.predict(X_2)
print "mean square error: ", mean_squared_error(y_2, regr_predicted)
print "variance or r-squared: ", explained_variance_score(y_2, regr_predicted)
sorted(zip(regr.feature_importances_, X_2.columns))
y = mov['category']
X = mov.drop(['category', 'imdb_score'], axis=1)
Now, we use our category column as our y-value in order to better classify our imdb-scores and our models. I developed a few graphs to show some basic profiling in terms of specific variables.
model = LogisticRegression()
rfe = RFE(model, 4) #asking four best attributes
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
print "------------------------"
print " "
rfe_class_ranks = rfe.ranking_
print sorted(zip(rfe_class_ranks, X.columns))
X_logistic = mov[['duration', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users']]
print X_logistic.head()
X_train, X_test, y_train, y_test = train_test_split(X_logistic, y, test_size=0.3, random_state=0)
lr = LogisticRegression()
lr.fit(X_train, y_train)
#Model evaluation
print metrics.accuracy_score(y_test, lr.predict(X_test))
print metrics.confusion_matrix(y_test, lr.predict(X_test))
print metrics.classification_report(y_test, lr.predict(X_test))
print metrics.roc_auc_score(y_test, lr.predict(X_test))
We used a feature selection model (RFE) again, to establish and confirm that the values we are considering significant keep their level of impact on our y-value. We can still observe that duration, num_critics_for_reviews and num_user_for_reviews, rank the highest in our feature selection model which has an accuracy of 65%.
# split validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Initialize DecisionTreeClassifier()
dt = DecisionTreeClassifier()
# Train a decision tree model
dt.fit(X_train, y_train)
print len(X_train), len(y_train)
print len(X_test), len(y_test)
print metrics.accuracy_score(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, dt.predict(X_test))
print "--------------------------------------------------------"
print metrics.roc_auc_score(y_test, dt.predict(X_test))
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()
tree.export_graphviz(dt, out_file='data/decisiontree2.dot', feature_names=X.columns)
from IPython.display import Image
Image("data/decisiontree2.png")
Our decision tree has a 67% accuracy. However, our full grown tree is too complicated to be practical.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
dt_simple = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
dt_simple.fit(X_train, y_train)
print metrics.accuracy_score(y_test, dt_simple.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, dt_simple.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, dt_simple.predict(X_test))
print "--------------------------------------------------------"
print metrics.roc_auc_score(y_test, dt_simple.predict(X_test))
We reduce the max depth of our decision tree to make our previous model more practical, and our model accuracy increases all the way to 71.8%.
tree.export_graphviz(dt_simple, out_file='data/decisiontree_simple2.dot', feature_names=X.columns)
from IPython.display import Image
Image("data/decisiontree_simple2.png")
k_range = range(1, 10)
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores.append(np.mean(cross_val_score(knn, X, y, cv=10, scoring='accuracy')))
# plot the K values (x-axis) versus the 10-fold CV score (y-axis)
plt.figure()
plt.plot(k_range, scores)
plt.xlabel('k value')
plt.ylabel('accuracy')
from sklearn.grid_search import GridSearchCV
knn = KNeighborsClassifier()
k_range = range(1, 10)
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid.fit(X, y)
# check the results of the grid search
grid.grid_scores_
grid_mean_scores = [result[1] for result in grid.grid_scores_]
plt.figure()
plt.plot(k_range, grid_mean_scores)
plt.show()
print grid.best_score_
print grid.best_params_
print grid.best_estimator_
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# initialize KNeighborsClassifier() and train a KNN Model
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
print metrics.accuracy_score(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
print metrics.confusion_matrix(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
print metrics.classification_report(y_test, knn.predict(X_test))
print "--------------------------------------------------------"
print metrics.roc_auc_score(y_test, knn.predict(X_test))
Our KNN model is maybe one of our weakest models with an accuracy of 52%.
clf = RandomForestClassifier(n_estimators=20) #building 20 decision trees
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)
print metrics.accuracy_score(y_test, clf.predict(X_test)) #overall accuracy
print metrics.confusion_matrix(y_test, clf.predict(X_test))
print metrics.classification_report(y_test, clf.predict(X_test))
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), clf.feature_importances_), X.columns))
pd.DataFrame({'feature':X.columns, 'importance':clf.feature_importances_}).\
sort_values('importance',ascending=False).head()
sns.boxplot(mov.category, mov.duration)
sns.boxplot(mov.category, mov.num_critic_for_reviews)
regjointplot = sns.jointplot("category", "num_voted_users", mov, kind="reg")
We now start by normilizing the data. Given that we have so many different scales, trying to analyze them altogether may be causing our models to be somewhat innacurate. By clustering, we will try to profile each of our movies into clusters of similar movies throughout all the variables (characteristics) that we are analyzing.
mov.var()
mov_norm = (mov - mov.mean()) / (mov.max() - mov.min())
mov_norm.head()
from scipy.spatial.distance import cdist
K = range(1, 10)
meandistortions = []
for k in K:
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(mov_norm)
meandistortions.append(sum(np.min(cdist(mov_norm, kmeans.cluster_centers_, 'euclidean'), axis=1)) / mov_norm.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
import scikitplot as skplt
kmeans = KMeans(random_state=1)
skplt.cluster.plot_elbow_curve(kmeans, mov_norm, cluster_ranges=range(1, 10))
We use the elbow method to have a statistical reference for the optimal number of clusters to be created in our analysis. It is still up to our own judgement to decide the the amount of clusters needed, but according to the graph, there is no significant change when going from 4-5 or 5-6. This may mean that our optimal number of cluster could be within these values.
k_means = KMeans(init='k-means++', n_clusters=4, random_state=0)
k_means.fit(mov_norm)
k_means.labels_
k_means.cluster_centers_
mov_norm1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
mov_norm1.head()
mov = mov.reset_index(drop=True)
mov_norm1 = mov_norm1.reset_index(drop=True)
mov_norm2 = mov.join(mov_norm1)
mov_norm2.head()
y = mov_norm2['cluster']
X = mov_norm2.drop(['category', 'imdb_score', 'cluster'], axis=1)
from sklearn.ensemble import RandomForestClassifier
mov_cmodel = RandomForestClassifier(n_estimators=20)
mov_cmodel = mov_cmodel.fit(X, y)
clf.score(X,y)
print metrics.accuracy_score(y, clf.predict(X)) #overall accuracy
print metrics.confusion_matrix(y, clf.predict(X))
print metrics.classification_report(y, clf.predict(X))
Unfortunately our Random Forest Classifier yielded an accuracy of just 15%.
mov_norm2.groupby('cluster').mean()
mov_norm2.groupby('cluster')['imdb_score'].mean().sort_values(ascending=False)
mov_norm2.groupby('cluster')['imdb_score'].mean().sort_values(ascending=False).plot(kind='bar')
mov_norm2.groupby('cluster')['duration'].mean().sort_values(ascending=False)
mov_norm2.groupby('cluster')['duration'].mean().sort_values(ascending=False).plot(kind='bar')
mov_norm2.groupby('cluster')['num_critic_for_reviews'].mean().sort_values(ascending=False)
mov_norm2.groupby('cluster')['num_critic_for_reviews'].mean().sort_values(ascending=False).plot(kind='bar')
mov_norm2.groupby('cluster')['num_voted_users'].mean().sort_values(ascending=False)
mov_norm2.groupby('cluster')['num_voted_users'].mean().sort_values(ascending=False).plot(kind='bar')
We run into a weird situation in which our cluster 0 and cluster 3, have very close IMDB Scores and some categories favor one or the other. For example, we can observe that movies with:
This may mean that, since cluster 0 has a higher avg IMDB Score, that movies that fall within 126 min, and receive averages of 256 number of critic for reviews and 294000 number of voted users, get higher IMDB Scores.
After going through all these analyses, we can conclude that there were a few repetitive variables that have ranked higher than all others in almost every test or model like:
- duration
- num_critic_for_reviews
- num_voted_users
These varaibles, amongst a few other like the Genre (like Drama for example with a high correlation to our imdb scores) can probably predict a decent amount of information needed to create high-quality, high/ranking movies. From our different Feature Selection models, we can see that these particular rankings explain a great deal of what reality actually looks like, and with the help of a few other less impactful values, they can reach very high accuracy scores.
Among our 3 categories:
- Regression
- Classification
- Duration
... these our the best models created.
Regression:
- Multiple Regression:
- MSE: 75%
- R-Squared: 32.5%
- Ridge:
- MSE: 75%
- R-Squared: 32.5%
- Lasso:
- MSE: 75%
- R-Squared: 32.5%
- Random Forest:
- MSE: 10%
- R-Squared: 90.9%
Classification:
KNN:
Random Forest Classifier:
Cluster: -Random Forest Classifier:
- Accuracy: 15%
According to all my analysis, in order for a movie to score between a 8-10 in terms of IMDB's ranking, they should mostly focus on creating movies that in a way are "for the customer". This means that following trends is a safe route for different producers. This does not mean that going against all trends and flow wouldn't result in a good score, but the most influential variables are affected by critics and votes from customers in the IMDB website. This means that not even budget or profit is as influential as how the customer perceives and ranks the movie. Still, of course, every effort should be made to reach this goal. One of the most apparent influential varaibles that can be controlled before releasing the movie like the duration of the movie. Apparently if the movie lies within an average of 123 minutes it usually receives a high imdb-score. The ranking could be better if it's scale would consider control variable from the producers and developers. What I mean by that is that the scale is pretty much influence by how the audience reacts rather by what the companies can do themselves to achieve a good reaction. For example a scale that would be more influenced by budget or likes for posters and releases. Even though these variables are included here, again, the scale does not consider them as much as customer reviews.